30 research outputs found
Fusion of Range and Stereo Data for High-Resolution Scene-Modeling
This work has received funding from Agence Nationale de la Recherche under the MIXCAM project number ANR-13-BS02-0010-01. Georgios Evangelidis is the corresponding author
Tracking Objects as Points
Tracking has traditionally been the art of following interest points through
space and time. This changed with the rise of powerful deep networks. Nowadays,
tracking is dominated by pipelines that perform object detection followed by
temporal association, also known as tracking-by-detection. In this paper, we
present a simultaneous detection and tracking algorithm that is simpler,
faster, and more accurate than the state of the art. Our tracker, CenterTrack,
applies a detection model to a pair of images and detections from the prior
frame. Given this minimal input, CenterTrack localizes objects and predicts
their associations with the previous frame. That's it. CenterTrack is simple,
online (no peeking into the future), and real-time. It achieves 67.3% MOTA on
the MOT17 challenge at 22 FPS and 89.4% MOTA on the KITTI tracking benchmark at
15 FPS, setting a new state of the art on both datasets. CenterTrack is easily
extended to monocular 3D tracking by regressing additional 3D attributes. Using
monocular video input, it achieves 28.3% [email protected] on the newly released
nuScenes 3D tracking benchmark, substantially outperforming the monocular
baseline on this benchmark while running at 28 FPS.Comment: ECCV 2020 Camera-ready version. Updated track rebirth results. Code
available at https://github.com/xingyizhou/CenterTrac
TraMNet - Transition Matrix Network for Efficient Action Tube Proposals
Current state-of-the-art methods solve spatiotemporal action localisation by
extending 2D anchors to 3D-cuboid proposals on stacks of frames, to generate
sets of temporally connected bounding boxes called \textit{action micro-tubes}.
However, they fail to consider that the underlying anchor proposal hypotheses
should also move (transition) from frame to frame, as the actor or the camera
does. Assuming we evaluate 2D anchors in each frame, then the number of
possible transitions from each 2D anchor to the next, for a sequence of
consecutive frames, is in the order of , expensive even for small
values of . To avoid this problem, we introduce a Transition-Matrix-based
Network (TraMNet) which relies on computing transition probabilities between
anchor proposals while maximising their overlap with ground truth bounding
boxes across frames, and enforcing sparsity via a transition threshold. As the
resulting transition matrix is sparse and stochastic, this reduces the proposal
hypothesis search space from to the cardinality of the thresholded
matrix. At training time, transitions are specific to cell locations of the
feature maps, so that a sparse (efficient) transition matrix is used to train
the network. At test time, a denser transition matrix can be obtained either by
decreasing the threshold or by adding to it all the relative transitions
originating from any cell location, allowing the network to handle transitions
in the test data that might not have been present in the training data, and
making detection translation-invariant. Finally, we show that our network can
handle sparse annotations such as those available in the DALY dataset. We
report extensive experiments on the DALY, UCF101-24 and Transformed-UCF101-24
datasets to support our claims.Comment: 15 page
Automatic alignment of surgical videos using kinematic data
Over the past one hundred years, the classic teaching methodology of "see
one, do one, teach one" has governed the surgical education systems worldwide.
With the advent of Operation Room 2.0, recording video, kinematic and many
other types of data during the surgery became an easy task, thus allowing
artificial intelligence systems to be deployed and used in surgical and medical
practice. Recently, surgical videos has been shown to provide a structure for
peer coaching enabling novice trainees to learn from experienced surgeons by
replaying those videos. However, the high inter-operator variability in
surgical gesture duration and execution renders learning from comparing novice
to expert surgical videos a very difficult task. In this paper, we propose a
novel technique to align multiple videos based on the alignment of their
corresponding kinematic multivariate time series data. By leveraging the
Dynamic Time Warping measure, our algorithm synchronizes a set of videos in
order to show the same gesture being performed at different speed. We believe
that the proposed approach is a valuable addition to the existing learning
tools for surgery.Comment: Accepted at AIME 201
Asynchronous, Photometric Feature Tracking using Events and Frames
We present a method that leverages the complementarity of event cameras and
standard cameras to track visual features with low-latency. Event cameras are
novel sensors that output pixel-level brightness changes, called "events". They
offer significant advantages over standard cameras, namely a very high dynamic
range, no motion blur, and a latency in the order of microseconds. However,
because the same scene pattern can produce different events depending on the
motion direction, establishing event correspondences across time is
challenging. By contrast, standard cameras provide intensity measurements
(frames) that do not depend on motion direction. Our method extracts features
on frames and subsequently tracks them asynchronously using events, thereby
exploiting the best of both types of data: the frames provide a photometric
representation that does not depend on motion direction and the events provide
low-latency updates. In contrast to previous works, which are based on
heuristics, this is the first principled method that uses raw intensity
measurements directly, based on a generative event model within a
maximum-likelihood framework. As a result, our method produces feature tracks
that are both more accurate (subpixel accuracy) and longer than the state of
the art, across a wide variety of scenes.Comment: 22 pages, 15 figures, Video: https://youtu.be/A7UfeUnG6c
Content-Aware Unsupervised Deep Homography Estimation
Homography estimation is a basic image alignment method in many applications.
It is usually conducted by extracting and matching sparse feature points, which
are error-prone in low-light and low-texture images. On the other hand,
previous deep homography approaches use either synthetic images for supervised
learning or aerial images for unsupervised learning, both ignoring the
importance of handling depth disparities and moving objects in real world
applications. To overcome these problems, in this work we propose an
unsupervised deep homography method with a new architecture design. In the
spirit of the RANSAC procedure in traditional methods, we specifically learn an
outlier mask to only select reliable regions for homography estimation. We
calculate loss with respect to our learned deep features instead of directly
comparing image content as did previously. To achieve the unsupervised
training, we also formulate a novel triplet loss customized for our network. We
verify our method by conducting comprehensive comparisons on a new dataset that
covers a wide range of scenes with varying degrees of difficulties for the
task. Experimental results reveal that our method outperforms the
state-of-the-art including deep solutions and feature-based solutions.Comment: Accepted by ECCV 2020 (Oral, Top 2%, 3 over 3 Strong Accepts). Jirong
Zhang and Chuan Wang are joint first authors, and Shuaicheng Liu is the
corresponding autho
4D Match Trees for Non-rigid Surface Alignment
This paper presents a method for dense 4D temporal alignment of partial reconstructions of non-rigid surfaces observed from single or multiple moving cameras of complex scenes. 4D Match Trees are introduced for robust global alignment of non-rigid shape based on the similarity between images across sequences and views. Wide-timeframe sparse correspondence between arbitrary pairs of images is established using a segmentation-based feature detector (SFD) which is demonstrated to give improved matching of non-rigid shape. Sparse SFD correspondence allows the similarity between any pair of image frames to be estimated for moving cameras and multiple views. This enables the 4D Match Tree to be constructed which minimises the observed change in non-rigid shape for global alignment across all images. Dense 4D temporal correspondence across all frames is then estimated by traversing the 4D Match tree using optical flow initialised from the sparse feature matches. The approach is evaluated on single and multiple view images sequences for alignment of partial surface reconstructions of dynamic objects in complex indoor and outdoor scenes to obtain a temporally consistent 4D representation. Comparison to previous 2D and 3D scene flow demonstrates that 4D Match Trees achieve reduced errors due to drift and improved robustness to large non-rigid deformations
Behavior Discovery and Alignment of Articulated Object Classes from Unstructured Video
We propose an automatic system for organizing the content of a collection of
unstructured videos of an articulated object class (e.g. tiger, horse). By
exploiting the recurring motion patterns of the class across videos, our
system: 1) identifies its characteristic behaviors; and 2) recovers
pixel-to-pixel alignments across different instances. Our system can be useful
for organizing video collections for indexing and retrieval. Moreover, it can
be a platform for learning the appearance or behaviors of object classes from
Internet video. Traditional supervised techniques cannot exploit this wealth of
data directly, as they require a large amount of time-consuming manual
annotations.
The behavior discovery stage generates temporal video intervals, each
automatically trimmed to one instance of the discovered behavior, clustered by
type. It relies on our novel motion representation for articulated motion based
on the displacement of ordered pairs of trajectories (PoTs). The alignment
stage aligns hundreds of instances of the class to a great accuracy despite
considerable appearance variations (e.g. an adult tiger and a cub). It uses a
flexible Thin Plate Spline deformation model that can vary through time. We
carefully evaluate each step of our system on a new, fully annotated dataset.
On behavior discovery, we outperform the state-of-the-art Improved DTF
descriptor. On spatial alignment, we outperform the popular SIFT Flow
algorithm.Comment: 19 pages, 19 figure, 3 tables. arXiv admin note: substantial text
overlap with arXiv:1411.788
Jitter-free registration for Unmanned Aerial Vehicle Videos
International audienceUnmanned Aerial Vehicles (UAVs), such as tethered drones, become increasingly popular for video acquisition, within video surveillance or remote, scientific measurement contexts. However, UAV recordings often present an unstable, variable viewpoint that is detrimental to the automatic exploitation of their content. This is often countered by one amongst two strategies, video registration and video stabilization, which are usually affected by distinct issues, namely jitter and drifting. This paper proposes a hybrid solution between both techniques that produces a jitter-free registration. A lightweight implementation enables real time, automatic generation of videos with a constant viewpoint from unstable video sequences acquired with stationary UAVs. Performance evaluation is carried out using video recordings from traffic surveillance scenes up to 15 minutes long, including multiple mobile objects